Add ComputeDomain for running multi-node workloads #225

klueska · 2025-01-09T14:42:31Z

No description provided.

Signed-off-by: Kevin Klues <[email protected]>

For now just mark one well-known erro as permanent. Future commits will abstract this better and mark more errors as permanaent. Signed-off-by: Kevin Klues <[email protected]>

Signed-off-by: Kevin Klues <[email protected]>

ArangoGutierrez · 2025-02-19T16:24:55Z

api/nvidia.com/resource/v1beta1/computedomain.go

+// ComputeDomainSpec provides the spec for a ComputeDomain.
+type ComputeDomainSpec struct {
+	NumNodes int                       `json:"numNodes"`
+	Channel  *ComputeDomainChannelSpec `json:"channel"`


Should channel be optional thinking on non imex use cases in the future, I know currently we are solely focused on imex support, but if we want to carry on the concept of computeDomain, we might face clusters without imex (channels)

ArangoGutierrez · 2025-02-19T16:26:37Z

api/nvidia.com/resource/v1beta1/computedomainconfig.go

@@ -0,0 +1,89 @@
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


maybe like some clothing brands do Since 1987* but I am not lawyer, maybe the number on the license header has a deeper legal meaning

ArangoGutierrez · 2025-02-19T16:29:46Z

api/nvidia.com/resource/v1beta1/register.go

@@ -0,0 +1,49 @@
+/*
+ * Copyright (c) 2022, NVIDIA CORPORATION.  All rights reserved.


Is DRA 2022 old?

Signed-off-by: Kevin Klues <[email protected]>

jgehrcke · 2025-02-20T11:40:02Z

templates/compute-domain-daemon.tmpl.yaml

+            tail -f /dev/null & wait
+          fi
+          /usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg
+          tail -n +1 -f /var/log/nvidia-imex.log & wait


Discussed in a sync meeting a while ago: here we give up control of the IMEX daemon process (do we? how does errexit behave when a daemonized process exits non-zero?). In any case, for robustness and debuggability it will be good to actively monitor the health of the IMEX daemon process (polling the process, or better: getting a health signal actively and straight from the process). I'd like to look into that at some point, after merge.

This is still a problem. If the daemon crashes we will not exit (but the liveness probe will eventually fail and the pod will be restarted). We should make it more robust as a followup (probably by not doing everything in bash but instead writing a small go utility).

klueska force-pushed the add-multi-node-crd branch 10 times, most recently from 38065b4 to 3e51cd8 Compare January 14, 2025 08:15

klueska force-pushed the add-multi-node-crd branch 20 times, most recently from 4ce9bdb to 0d435d8 Compare January 22, 2025 15:56

klueska added 17 commits February 19, 2025 10:28

Limit Deployment's ResourceClaimTemplate to driver namespace

381ee05

Signed-off-by: Kevin Klues <[email protected]>

Make the ResourceClaimTemplateManager specific to targeting daemons

d052490

Signed-off-by: Kevin Klues <[email protected]>

Move to ResourceClaimTemplates instead of global ResourceClaims

b082ab0

Signed-off-by: Kevin Klues <[email protected]>

Remove ImmediateMode and make Delayed the only option

45e0177

Signed-off-by: Kevin Klues <[email protected]>

Add waiting for dependent objects of ComputeDomain to be fully removed

75edd9b

Signed-off-by: Kevin Klues <[email protected]>

Use a daemonset instead of a deployment to run ComputeDomain daemons

7821db4

Signed-off-by: Kevin Klues <[email protected]>

Block ComputeDomain deletion while a workload is still running in it

e9c7101

Signed-off-by: Kevin Klues <[email protected]>

Add a liveness probe to the ComputeDomain daemon

9cbc576

Signed-off-by: Kevin Klues <[email protected]>

Ensure that ResourceClaim / ComputeDomain namespace are the same

3373134

Signed-off-by: Kevin Klues <[email protected]>

Add the notion of a "permanent" error to the kubelet plugin

993b853

For now just mark one well-known erro as permanent. Future commits will abstract this better and mark more errors as permanaent. Signed-off-by: Kevin Klues <[email protected]>

Harden logic around calling prepare / unprepare on allocated claims

7783596

Signed-off-by: Kevin Klues <[email protected]>

Abstract out getConfigResultsMap so it can be reused later

0e5611f

Signed-off-by: Kevin Klues <[email protected]>

Unconditionally unprepare imex channels and daemons

ef25561

Signed-off-by: Kevin Klues <[email protected]>

Treat a ClusterUUID of all 0s to mean no IMEX support as well

a37de39

Signed-off-by: Kevin Klues <[email protected]>

Add a level of indiraction with a new 'channel' field in ComputeDomain

4625953

Signed-off-by: Kevin Klues <[email protected]>

Ensure that the fabric-imex-mgmt nvcap is created and injected always

718e69d

Signed-off-by: Kevin Klues <[email protected]>

Recursively unmount /proc/driver/nvidia if it is mounted

5a83bac

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the add-multi-node-crd branch from 1e6a587 to a45c238 Compare February 19, 2025 10:28

klueska added 4 commits February 19, 2025 15:36

Add demo specs for working with compute domains

3ea7913

Signed-off-by: Kevin Klues <[email protected]>

Only inject channel / daemon settings if running on an IMEX capable node

578ab87

Signed-off-by: Kevin Klues <[email protected]>

Add periodic cleanup of stale objects owned by deleted ComputeDomains

4464442

Signed-off-by: Kevin Klues <[email protected]>

Allow the DRA driver for GPUs to be force installed if desired

222df11

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the add-multi-node-crd branch 2 times, most recently from df90001 to 764c3c7 Compare February 19, 2025 15:43

ArangoGutierrez reviewed Feb 19, 2025

View reviewed changes

Determine cliqueID from NVML not node label

474f968

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the add-multi-node-crd branch from 764c3c7 to 474f968 Compare February 19, 2025 17:00

jgehrcke reviewed Feb 20, 2025

View reviewed changes

klueska merged commit d1fad7e into NVIDIA:main Feb 20, 2025
4 checks passed

jgehrcke mentioned this pull request Mar 22, 2025

IMEX daemon startup error not caught, CD marked as ready #289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ComputeDomain for running multi-node workloads #225

Add ComputeDomain for running multi-node workloads #225

klueska commented Jan 9, 2025

ArangoGutierrez Feb 19, 2025

ArangoGutierrez Feb 19, 2025

ArangoGutierrez Feb 19, 2025

jgehrcke Feb 20, 2025

klueska Feb 20, 2025

		@@ -0,0 +1,89 @@
		/*
		* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,49 @@
		/*
		* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

Add ComputeDomain for running multi-node workloads #225

Add ComputeDomain for running multi-node workloads #225

Conversation

klueska commented Jan 9, 2025

ArangoGutierrez Feb 19, 2025

Choose a reason for hiding this comment

ArangoGutierrez Feb 19, 2025

Choose a reason for hiding this comment

ArangoGutierrez Feb 19, 2025

Choose a reason for hiding this comment

jgehrcke Feb 20, 2025

Choose a reason for hiding this comment

klueska Feb 20, 2025

Choose a reason for hiding this comment